
Elseer Radak
Gallente Bessemer Inc.
|
Posted - 2006.12.05 18:18:00 -
[1]
Lockheed19,
Large scale systems use what is called a "Distributed architecture". That is, lots of small bits doing some of the work. Individually, each part is weak; together the parts work together and are strong. Each small part is collected into a series of middle parts called a node. In eve, each node will serve a given region (where the word region may mean a geographic area as define by internet connection or may describe how data is stored in eh database end). Each node will then be part of something bigger called a cluster. Several clusters make up the whole of the "Game" we play as eve.
Think of a spider web, a tapestry or a woven garment -- any of these under tension from a weight (of connections, data, whatever). Now one of these smaller (smaller being a relative term) servers fails - it snaps to use the metaphor. The next item, closest (close being a relative, deep and nigh mystical term) will try to take up the job of Bering "weight". But that next server cannot take the strain, it crashes as well. The "weight" goes up the chain of links, breaking everything in sight essentially ripping the garment or crashing the "node" (collection of servers)
This series of events is called a cascade failure. it's a horrible thing to have occur. Within a short time all the servers have crashed (the woven item has come undone, leaving a pile of torn threads). It's not clear what happened or how to fix it.
In the end, all that anyone can do, in the short term, is adjust some parameters to reduce the weight (number of connections, fleet battles, object in space whatever) or adjust the fault tolerance (permitting the system to go slower). In the medium term, Production, QA and Dev will analyze the failures and determine the next best steps. After that analysis is complete, dev will come up with a fix, QA will do their best to test this and prod will deploy this to the users. Since the system of "eve" is so big, QA cannot easily test it under real load conditions.
So, in the end, EVE is paying for being so big. To me (as a network QA person) the eve folks are doing their best to make this work out. They (ccp) are breaking new ground and I would dearly love to read how they test this all out and what works and what does not work in terms of testing.
Does that help?
Elseer R.
--
Reality: A consentual experiance
|